This article investigates why transformer models struggle with multi-digit multiplication despite their advanced capabilities. Through reverse-engineering, the authors identify that while the model can encode necessary long-range dependencies, it converges to a local optimum that lacks these dependencies, suggesting that introducing an auxiliary loss can help the model learn this task effectively.
transformers ✓
multiplication ✓
long-range dependencies ✓